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Abstract 



^ I For scientific computations on a digital computer the set of real numbers is usu- 

■ ■ ally approximated by a finite set F of "floating-point" numbers. We compare the 



numerical accuracy possible with different choices of F having approximately the 
same range and requiring the same word length. In particular, we compare different 
choices of base (or radix) in the usual floating-point systems. The emphasis is on 
the choice of F, not on the details of the number representation or the arithmetic, 
but both rounded and truncated arithmetic are considered. Theoretical results are 
given, and some simulations of typical floating point-computations (forming sums, 
^ ■ solving systems of linear equations, flnding eigenvalues) are described. If the leading 

fraction bit of a normalized base 2 number is not stored explicitly (saving a bit), and 
the criterion is to minimise the mean square roundoff error, then base 2 is best. If 
^ ■ unnormalized numbers are allowed, so the flrst bit must be stored explicitly, then 

^ . base 4 (or sometimes base 8) is the best of the usual systems. 

. Index Terms: Base, floating-point arithmetic, radix, representation error, rms error, 

I rounding error, simulation. 
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1 Introduction 

A real number x is usually approximated in a digital computer by an element fl(x) of a finite 
set F of "floating-point" numbers. We regard the elements of F as exactly representable real 
' numbers, and take fl(x) as the floating-point number closest to x. The deflnition of "closest", 

rules for breaking ties, and the possibility of truncating instead of rounding are discussed later. 

We restrict our attention to binary computers in which floating-point numbers are repre- 
sented in a word (or multiple word) of flxed length w bits, using some convenient (possibly 
redundant) code. Usually F is a set of numbers of the form 

t 

sJ2dif3'-' (1.1) 
1=1 

where /? = 2^^ > 1 is the base (or radix), t > is the number of digits, s = ±1 is a sign, e is an 
exponent in some flxed range 

m < e < M, (1.2) 
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and each di is a P-axy digit 0, 1, . . . , /3 — 1 . Other possible floating-point number systems (i.e, 
choices of F) are mentioned in Section 3. 

Since the coding of the exponent e and the signed fraction (s; di, . . . , dt) must fit into w 
bits, there is a tradeoff between precision and range. (A discussion of precision and range 
requirements for general scientific computing may be found in Cody [7, 8].) Wc do not consider 
this tradeoff; instead we suppose that the range and word length is prescribed, and we study 
the dependence of the precision on the base /3. 

With higher bases less bits are needed for the exponent, so more are available for the fraction 
(see Section 2 for details). However, more leading fraction bits may be zero, so the best choice 
of base is not immediately obvious. Our aim is to compare the attainable precision of systems 
with different bases. Theoretical results are given in Sections 4 and 5, and some simulations are 
described in Sections 6 and 7. The conclusions are summarised in Section 7. 

Since we are interested in the precision attainable with different number systems, we assume 
that the arithmetic is the best possible. In other words, if x, y G F, and j is an arithmetic 
operation, we asume that x f y is found to sufficient accuracy to give the correct (rounded) 
result fl(x t y). Ensuring this may be too expensive in practice, but our conclusions should be 
valid provided several guard units are used when computing fl(,x f y). The reduction in precision 
caused by using only a small number of guard digits is discussed by Kuki and Cody [18]. 



2 The Usual Systems 

A floating-point number of the form (1.1) may be written as 

u 

sY,b,2^^-^ (2.1) 
i=i 

where bj^(^_i^_^i • • • 6^^ is the binary form of the 2^-ary digit di, and u = kt is the number of bits 
required to code di,... , df. We use (2.1) in preference to (1.1), and do not insist that t must be 
an integer. The details of the coding of the exponent e and the signed fraction (s; 6i, . . . , bu) in 
a w-hit word do not concern us. 

The representation (2.1) is said to be normalised if at least one of hi,...,hk is nonzero. 
Prom (2.1) and the bound (1.2) on e, the largest and smallest floating-point numbers having a 
normalised representation are 

/max = 2^=^(1-2-") (2.2) 

and 

/min = 2^^, (2.3) 

respectively. If the range R of the system is defined to be log2(/max//min) then, negelecting the 
term 2-" in (2.2) 

k{M-m) = R. (2.4) 

Thus, for systems with the same range, k{M — m) is invatiant. 

Goldberg [10], McKeeman [21], and others have observed that with base 2 the leading fraction 
bit b\ can be implicit, provided only normalized representations of nonzero numbers are allowed 
and a special exponent is reserved for zero. Define 

{2, if this "implicit-first-bit" idea is used 
1, otherwise 

so u — \0g2P bits are required to code the fraction (61, . . . , 5^) . One bit is required for the sign, 
and at least |^log2(M — m)] for the exponent. Thus 

It - log2P+ 1 + [log2(M - m)] < w. (2.6) 
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For a sensible design, equality will hold in (2.6), and M — m will be a power of two (or one less 
if the exponent is coded in one's complement or a special exponent is reserved for zero, but such 
minor differences are unimportant). Thus (2.4) gives 

2-"fcp = 2^-'^R. (2.7) 

The right side depends only on the word length and the range, so (2.7) gives a useful relation 
between the fraction length u and the base P = 2^ . 

Many different sustems of the class described here have actually been used. They include, 
with various word lengths, ranges and rounding (or truncating) rules: 



p = 


2,p = 2 


{e.g 


■., PDF 11-45); 


p = 


2,p=l 


(e.g 


•., CDC 6400); 


p = 


4 


(e.g 


'., Illiac II); 


/3 = 


8 


(e.g 


•., Burroughs 5500); and 


P = 


16 


(e.g 


•., IBM 360). 



In some machines, bases other than /3 are used in the arithmetic unit. For example, in the 
ILLIAC III (Atkins [2]) multiplication and division are performed with base 256, but numbers are 
stored with base 16. 



3 Other Systems 

Morris [22] suggests using "tapered" systems in which the division of bits between the exponent 
and the fraction depends on the exponent. The idea is to have a longer fraction for the (commonly 
occurring) numbers with exponents close to zero than for numbers with large exponents. We do 
not consider these interesting systems here. 

Brown and Richman [5] assume that floating-point numbers are represented in a computer 
word with two sign bits and a fixed number of ^-state devices, for some fixed q > 2, and they 
compare bases of the form q'^ . Although the results of Sections 4 and 5 can be generalized easily 
to cover their assumptions, we restrict ourselves to g = 2, for this is the only case of practical 
importance. 

Finally, we describe a "logarithmic" system that is interesting for theoretical reasons (see 
Section 4), although it is impractical (because of the difficulty of performing floating-point 
additions). Let a and b be positive integers which, together with the word length w , characterize 
the system. The floating-point numbers are zero and all nonzero real numbers x such that 
a • log2 |x| -|- 5 is one of the integers 1,2,..., 2^~^ — 1. If 

r 0, if2; = 

X{x) = I (3.1) 
[ sign(x)(a • log2 \x\ + b), if x ^ 

then the floating-point number x may be represented in a computer word by a convenient code 
for the integer X{x). Since 

r 0, if A = 

i sign(A) • 2(^-^)/», if A 7^ 

the largest and smallest positive floating-point numbers are 

fm^ = 2('-'-'-'y^ (3.3) 

and 

/min = 2(l-'')/^ (3.4) 
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respectively, and the range log2 (/max//min) is 



For example, taking a = 2^-^° and b = T"'"^ gives /max ^ 2256, /^^jn ~ 2-^56, and r ~ 512. 
If X and y are positive floating-point numbers with /mm ^ xy < /max then, from (3.1) 

X{xy) = A(x) + A(y)-5. (3.6) 

Thus, floating-point multiplication and division are easy to perform in a logarithmic system, 
and do not introduce any rounding errors. Unfortunately, there does not seem to be any easy 
way to perform floating-point addition. 



4 The Worst Case Relative-Error Criterion 

One measure of the precision of a floating-point number system is the worst relative error e 
made in approximating a real number x (not too large or small) by fl(a;), i.e.. 



e = sup 



X - fl(x) 



(4.1) 



The "worst case relative-error" criterion is simply to choose a number system (with the pre- 
scribed R and w) to minimise e . 

For the logarithmic systems described in Section 3, we see from (3.2) that 

e = 2V(2a)_i = i^. (4.2) 

(Here and later we neglect terms of order or 2"^", and logarithms are natural unless otherwise 
indicated.) If 

eo = i?2-"'log2 (4.3) 

then (3.5) and (4.2) give 

e = eo. (4.4) 

Now consider any floating-point number system with range R and word length w . If e is 
deflned by (4.1), then 

e > eo. (4.5) 

(In a logarithmic system, the logarithms of positive floating-point numbers arc uniformly spaced 
and all bit patterns are used.) Thus we use logarithmic systems as a standard of comparison for 
other, more practical, systems. 
Wilkinson [28] shows that 

e = 2'=-"-^ (4.6) 
for the number systems of Section 2. From (2.7), (4.3) and (4.6) 

= h{k,p) (4.7) 



eo kp\og2 

which shows how much e exceeds the best possible value eo for a number system with the same 
R and w . Table 1 gives fi{k,p) for k = 1,2, ... ,8 . 
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TABLE 1 

THEORETICAL WORST CASE AND RMS ERRORS* 



k 




B = 2^ 


fi {k, v) 




1 


2 


2 


1.44 


1.06 


i 


i 


o 
z 


z.oy 




2 


1 


4 


2.89 


1.68 


3 


1 


8 


3.85 


1.87 


4 


1 


16 


5.77 


2.45 


5 


1 


32 


9.23 


3.51 


6 


1 


64 


15.4 


5.34 


7 


1 


128 


26.4 


8.47 


8 


1 


256 


46.2 


13.9 



* See (4.7) and (5.8) for definitions of f\ and /2 . 



The table shows that the imphcit-first-bit base 2 systems are the best of those described in 
Section 2, and close to the best possible, on the worst case criterion. Of the explicit-first-bit 
systems, base 2 and base 4 are equally good. This may be explained as follows. Changing from 
base 2 to base 4 frees a bit from the exponent for the fraction. If the first 4-ary digit di is 2 or 3, 
the first fraction bit 6i is 1, and the extra fraction bit may increase the precision. However, if di 
is 1 then 6i is 0, and the bit gained is wasted. (According to Richman [24], Goldberg observed 
this independently.) If fl(a;) is defined by truncation rather than rounding then e is doubled, but 
the comparison between different bases is not changed. 

5 The RMS Relative-Error Criterion 

Consider forming the product of nonzero floating-point numbers xq, . . . (in one of the usual 
systems) by n floating-point multiplications, i.e., define pq = xq and pi = fl(pj„ia:j) for i = 
1, . . . ,n. If (5i = (pi-iXi — Pi) / (pi-iXi) is the relative error made in forming the ith product, 
then the relative error in the final result is 

^ ^ X0---Xn-Pn ^ i_-Q(i_^.) 
XQ ■ ■ ■ Xji 

n 

= Si + higher order terms . (5-1) 
1=1 

Thus 

|A| < ne (5.2) 

where e is defined by (4.1), and we have neglected a term of order n^e^ . Many other bounds on 
the rounding errors in algebraic processes are also of the form f{n)e (sec Wilkinson [28], [29]), 
which is a good reason for choosing a floating-point number system according to the worst case 
criterion of Section 4. However, the bound (5.2) is rather pessimistic, for the individual rounding 
errors 5i in (5.1) usually tend to cancel rather than to reinforce each other. (We are assuming 
an unbiased rounding rule as described in Section 6. With truncation or biased rounding the 
bound (5.2) may be realistic.) 

If the 5i were independent random variables, distributed with mean and variance af , then 
A would be distributed with mean and variance E^^^crf . Thus a reasonable probabilistic 
measure of the precision of a floating-point number system is the root-mcan-square (rms) value 
(5rms of (5 = (.T — fl(x))/a; , where x is distributed like the nonzero results of arithmetic operations 
performed during a typical floating-point computation. 
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The simulations described in Sections 6 and 7 suggest that the rms rounding error in floating- 
point eomuptations involving many arithmetie operations is often roughly proportional to (5i-jns 
(see also Weinstein [27]). Thus we prefer 6rms to other probabilistic measures of precision such as 
the expected value of \6\ (McKeeman [21]), the expected value of log2 \5\ (Kuki and Codi [18]), 
and the expected error in "units in the last place" (Kalian [15]). We disregard errors in the 
conversion from internal floating-point results to decimal output, for the rms value of these 
errors depends on the number of decimal places rather than on the internal number system. 
(For the effect of repeated conversions back and forth, see Matula [20].) 

What distribution should we assume for the nonzero real numbers x that are to be approxi- 
mated by floating-point numbers? Hamming [11], Knuth [17], and others argue that we should 
assume that log \x\ is uniformly distributed. There are two reasons why this assumption is only 
an approximation. Although log|x| may be approximately uniform locally, it is certainly not 
uniform on the entire interval [log /mirn log /max] • Also, the fine structure of the distribution is 
not uniform, for the numbers arising from multiplications or (more importantly) additions of 
floating-point numbers are really discrete rather than continuous variables. Nevertheless, we 
shall make Hamming's assumption in this section. It is certainly a much better approximation 
than assuming that x is uniformly distributed on some interval. 

For the logarithmic systems, S is uniformly distributed on [— eo,eo] , where eo is given by 
(4.3). Thus drms = ^0 ! where 

do = —= = ^ . (5.3) 

Because the assumption that log |x| is uniform is only an approximation, there is no result 
corresponding to the inequality (4.5), but the logarithmic systems still provide a convenient 
standard of comparison for other, more practical, systems. 

For the systems of Section 2, there is no loss of generality in assuming that x lies in [1//3, 1) 
and (by our assumption) log^ .x is uniformly distributed on [—1,0) . Consider numbers y dis- 
tributed uniformly on a small interval near x. The absolute error y — fl(y) is approximately uni- 
form on (— 2~"~^, 2~"~^) . (It is certainly not logarithmically distributed, as is assumed to derive 
(18') in Benschop and Ratz [3].) Hence a = (y — G.{y))/y is uniform on (— 2~"~^/x, 2~"~^/x) , 
and has probability density function (Feller [9]) 

r 2"x, if |a| < 2-"-Va; 

5.(«) = ' ' . (5-4) 
0, otherwise. 

Integrating over the interval 1) , we see that 6 is distributed with density 

f{6) = C g^{5)d\ogpx (5.5) 

( 2"(1 - 2~'')/{k ■ log2), if \S\ < 2-"-i 



= < 



0, otherwise. 



It is easy to find the expected value of S, 6^, \5\, log2 \S\, etc. from (5.6). In particular, we find 
that S is distributed with standard deviation 



and mean 0. (The mean is actually of order 2~^" , but terms of this order have been neglected.) 
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Prom (2.7), (5.3) and (5.6), 



and the last column of Table 1 gives J2{k,p) for = 1, . . . , 8 . The table shows that the implicit- 
first-bit base 2 systems are the best of the systems of Section 2 (and only 6 per cent worse than 
the logarithmic systems) on the rms relative-error criterion. Base 4 (closely followed by base 8) 
is best in the explicit-first-bit systems. The reason why base 4 is better than explicit base 2 is 
apparent from the discussion at the end of Section 4: \5\ is never greater for base 4 than for 
explicit base 2, and sometimes it is smaller. A similar argument shows that implicit base 2 is 
better than base 4. 

Because of the different ranges possible with base 4 and base 8, there are some choices of 
minimal acceptable range for which base 8 is preferable to base 4, but bases higher than 8 are 
always inferior to base 4 on the rms relative-error criterion. 

6 Simulation of Different Systems 

Three classes of floating-point computations were run, using various number systems with w = 32 
and i? ~ 512 (the same as for single-precision on the IBM 360 and many other computers). The 
systems were a logarithmic system Sq with a = 2^^ and b = 2^° (see Section 3), and the following 
examples of the systems described in Section 2. 



Si 


13 = 


2, u = 


23, p = 2 (base 2 with a 23-bit fraction, the first bit implicit) 




/3 = 


A, u = 


23 (base 4 with 23 bits or 11^ digits). 


Ss 


/? = 


2, u = 


22, p = 1 (base 2 with 22 bits, all explicit). 




13 = 


16, u = 


= 24 (base 16 with 24 bits or 6 digits). 




The 


same as S4 with truncation (towards zero) rather than rounding. 


S5 


/3 = 


256, u 


= 25 (base 256 with 25 bits or 3| digits). 



The rounding rule for systems Si to S5 is the "i?*-mode" of Kuki and Cody [18]: ii{x) is 
defined to be the floating-point number closest to x, and ties are broken by choosing fl(a;) so 
that its least significant fraction bit is one. Formally, if x is a nonzero real number with binary 
expansion 

00 

x = sY^b,2^^-^ (6.1) 

(taking the terminating expansion if there is one, normalizing so that one of 61, . . . , 6^ is nonzero, 
and neglecting the possibility of underflow or overflow), then 

if 6„+i = or E,°^o W = i 

(6.2) 

otherwise . 

The special case • • • = 11000 • • • is quite important, for it often occurs when x is the result 

of a floating-point addition, and neglecting it can lead to bias in the rounding. 

All the floating point number systems were simulated on an IBM 360/91 computer, with 
arithmetic operations performed in double precision {P = 16,u = 56) before rounding or trun- 
cating approximately. Thus, the number of guard units used was effectively infinite. The data 



fl(x) = < 



sJ2b,2'^^-^, 
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were pseudorandom double-precision numbers distributed as described in Section 7, and "exact" 
results were computed using double precision throughout. 

Forming sums, solving systems of linear equations, and finding the eigenvalues of symmetric 
matrices were the chosen classes of floating-point computations. They appear to be fairly typical 
of computations in which the effect of rounding errors may be important. Details, and the results 
of the simulations, are given in Section 7. Other classes that have been considered include 
solving ordinary differential equations (Henrici [12], [13], Hull and Swenson [14]), fast Fourier 
transforms (Kaneko and Liu [16], Ramos [23], and Weinstein [27]), matrix iterative processes 
(Benschop and Ratz [3]), solving positive-definite linear systems (Tienari [25]), and forming 
products (Section 5). 



7 Details and Results of the Simulations 

Sums 

Let m and n be positive integers. A number z was drawn from a uniform distribution on 
[0, 1], then numbers ,xi, . . . , Xn were drawn independently from a uniform distribution on [—Z, Z], 
where Z = 256^ is a scale factor used to avoid a bias in favour of any of the number systems 
(see Kuki and Cody [18]). The approximate sums Sj of fl(xi), . . . ,fl(x„) were accumulated, in 
the usual way, with each of the number systems Sj described in Section 6, and the errors 

n 

«i = (7.1) 



El 

i=l 



were found. (The denominator is used in preference to J2i=i fo ensure that aj is small.) The 
procedure was repeated m times and the rms values of the aj were found. For purposes 
of comparison between the systems, it is convenient to consider the normalized rms errors 
jj = Pj/Po . (Recall that Pq is the rms error for the logarithmic system Sq .) 

Table 2 gives 7^ for various choices of m and n . If the aj are considered as random variables 
drawn from a distribution with mean square Bj , then I3j and 7^ may be regarded as estimates 
of Bj and Bj/Bq , respectively, m was chosen large enough to ensure that the standard error of 
the estimates jj given in Tables 2-4 is less than five units in the last decimal place. 

For n = 1 we are merely estimating the rms relative error in approximating xi by fl(.xi) , 
and the results agree with the predictions of Section 5 (see the last column of Table 1). Except 
for S'4 , the effect of varying n is small, and does not affect the ranking of the systems. 

It may be shown that 

ro,n3/^,, fors; 

[ 0(n) , for the other systems 

so it is not surprising that 74 appears to grow like n^^'^ . (The same applies if truncation is 
downwards instead of towards zero.) Results for sums of positive numbers are similar, although 
Bj is larger by a factor of order n^/^ for all the systems. 
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TABLE 2 

RESULTS FOR SUMS 



n 


m/1000 


71 


72 


73 


74 


l'4 


75 


1 


1000 


1.06 


1.68 


2.12 


2.45 


4.89 


13.9 


2 


100 


1.11 


1.68 


2.23 


2.38 


5.53 


13.4 


4 


100 


1.13 


1.69 


2.25 


2.36 


6.33 


13.2 


8 


100 


1.12 


1.69 


2.24 


2.36 


7.95 


13.2 


10 


100 


1.12 


1.69 


2.23 


2.36 


8.76 


13.4 


16 


10 


1.11 


1.72 


2.22 


2.37 


10.9 


13.3 


32 


10 


1.09 


1.71 


2.18 


2.39 


15.9 


13.6 


64 


10 


1.08 


1.67 


2.14 


2.43 


22.4 


13.9 


100 


30 


1.06 


1.68 


2.13 


2.41 


28.1 


13.6 



Solving Systems of Linear Equations 

z\ and Z2 were drawn independently from a uniform distribution on [0,1] , giving scale factors 
Z\ = 256^^ and Z2 = 256^^ . Numbers ap^q {p,q = l,...,n) were drawn independently from 
a uniform distribution on [—Zi, Zi] ; and Xi, . . . ,Xn were drawn similarly from [—Z2, .^2] • For 
each of the number systems Sj , let ^l^-'^ = (fl(ap^q)), A = {up^q), x = {xp), b = (bp) = Ax, and 

= (fl(6p)). The system of equations 

A^^^y = 6^^) (7.3) 

was solved by Gaussian elimination with complete pivoting, giving the approximate solution 
y^^^ , and the error 

_ ||V^')-6||2 



I^IIe ||a:;||2 
\i/2 



(7.4) 



was computed. (Here \\A\\e = (^J2p=iJ2q=i(^p,qj ■ Prom results of Wilkinson [28], [29], aj is 
small even if A is rather ill conditioned.) The procedure was repeated m times, the rms values 
of the aj were computed, and the ratios 7^ = /3j/Po were found. The results for various m 
and n are given in Table 3. 



TABLE 3 

RESULTS FOR SYSTEMS OF LINEAR EQUATIONS 



n 


m/1000 


71 


72 


73 


74 


7^ 


75 


1 


100 


1.30 


2.06 


2.61 


2.99 


4.92 


17.0 


2 


100 


1.30 


2.01 


2.59 


2.90 


5.33 


16.3 


4 


10 


1.27 


1.97 


2.56 


2.80 


5.63 


15.7 


8 


4 


1.23 


1.89 


2.45 


2.65 


6.1 


14.9 


16 


1 


1.18 


1.82 


2.35 


2.60 


7.1 


14.4 



Multiplication and division are performed exactly in a logarithmic system, so /?o is less than 
would otherwise be expected, and 71, . . . , 75 are higher than for sums, especially for small values 
of n . However, the ratios of 71, . . . , 75 are much the same as for sums, and the ranking of the 
systems is preserved. Results for positive Qp^q and/or Xp are similar. 
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It is interesting that 74 < 274 for n = 1 and 2. When n = 1 and S'^ is used, the errors made 
in forming fl(ai 1) and fl(6i) tend to cancel when fl(6i)/fl(ai^i) is computed. Presumably there 
is a similar, though less marked, effect for n > 1 . 



Finding Eigenvalues of Symmetric Matrices 

Numbers ap^g {1 < p < q < n) were drawn independently from a uniform distribution on 

[—Z, Z] , where Z was a scale factor chosen as above. The other elements of ^ = (ap^q) were 

(i) (?) 

defined by symmetry. For each number system Sj , the approximate eigenvalues < • • • < An 

of A^^^ = (fl(ap^g)) were computed by reducing A^^^^ to tridiagonal form and then using the QR 
algorithm (Wilkinson [29]). We used translations of the Algol 60 procedures TREDl (Martin 
et al [19]) and TQLl (Bowler et al [4]), except for some trivial modifications to avoid unnecessary 
rounding errors when n = 2 . The stopping criterion for the QR algorithm was the same for 
all number systems. (The parameters macheps and tol of the procedures were set to 10""^ and 
10~^° , respectively.) The errors 



ai 



, i=l 




= [IZi^i-^^'^f] / \\A\\e (7.5) 



were computed. (Here Ai < • • • < A„ are the exact eigenvalues of A .) The procedure was 
repeated m times, the rms values Pj of the aj computed, and the ratios jj = Pj/Po found, as 
above. The results are given in Table 4. 



TABLE 4 

RESULTS FOR EIGENVALUES OF SYMMETRIC MATRICES 



n 


m/1000 


71 


72 


73 


74 


l'4 


75 


2 


100 


1.07 


1.61 


2.14 


2.38 


6.06 


15.2 


4 


10 


1.33 


2.24 


2.65 


3.60 


10.5 


25.8 


8 


3 


1.14 


2.01 


2.34 


3.73 


10.8 


29.6 


16 


1 


1.00 


1.82 


1.99 


3.49 


10.7 


28.8 



The method used for finding eigenvalues depends heavily on multiplications by matrices of 

the form (^'^g j where + = 1. The numbers c and s are certainly not distributed 

as assumed in Section 5. This, along with other observations made above, may explain the 
interesting variations in the 7^ . Despite these variations, the ranking of the different systems is 
as predicted in Section 5. 



8 Conclusions 

Comparing 74 with 74 in Tables 2-4 shows that the rms error for truncation is usually consider- 
ably more than twice as much as for rounding. However, truncation is often preferred because 
the usual implementation of rounding requires an extra carry propagation. An interesting com- 
promise is the "von Neumann round" (Burks et al [6], Urabe [26]), for which the result of an 
arithmetic operation is truncated, and then the least significant bit is set to one. (An exception 
could be made if the result is exactly representable; this would involve checking if the truncated 
bits were all zero.) No extra carry propagation is required, and the rms error is twice that for 
normal rounding, so considerably better than for truncation. 
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The most accurate practical systems are base 2 with the first fraction bit implicit. If the 
accuracy gained by having the first bit implicit is not considered sufficient compensation for the 
disadvantages entailed, then base 4 (or perhaps base 8) is the best choice. 

The accuracy lost by using base 16 or higher is roughly as predicted in Section 5. High bases 
may have some implementation advantages (Anderson et al [1], Atkins [2]). In practice both 
factors should be considered. The number of guard digits used is also important. The use of 
high bases, only one guard digit, and truncation instead of rounding is probably acceptable on 
machines with a long floating-point word. However, to minimize the need for double-precision 
computations, it seems wise to try to squeeze out the last drop of accuracy on a computer with 
a short floating-point word (say 32-40 bits). The amount that can be squeezed out is often 
significant. For example, our simulations show that using system Si instead of S'^^ is roughly 
equivalent to carrying one more decimal place. 
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